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Abstract 

We develop multi-step gradient methods for network-constrained optimization of strongly convex functions with 
Lipschitz-continuous gradients. Given the topology of the underlying network and bounds on the Hessian of the objective 
function, we determine the algorithm parameters that guarantee the fastest convergence and characterize situations when 
significant speed-ups can be obtained over the standard gradient method. Furthermore, we quantify how the performance 
of the gradient method and its accelerated counterpart are affected by uncertainty in the problem data, and conclude 
that in most cases our proposed method outperforms gradient descent. Finally, we apply the proposed technique to three 
engineering problems: resource allocation under network-wide budget constraints, distributed averaging, and Internet 
congestion control. In all cases, we demonstrate that our algorithm converges more rapidly than alternative algorithms 
reported in the literature. 

I. Introduction 

Distributed optimization has recently attracted significant attention from several research communities. Examples 
include the work on network utility maximization for resource allocation in communication networks [1], distributed 
coordination of multi-agent systems (2), collaborative estimation in wireless sensor networks [3], distributed machine 
learning |4|, and many others. The majority of these praxes apply gradient or sub-gradient methods to the dual formu- 
lation of the decision problem. Although gradient methods are easy to implement and require modest computations, 
they suffer from slow convergence. In some cases, such as the development of distributed power control algorithms for 
cellular phones [5 1, one can replace gradient methods by fixed-point iterations and achieve improved convergence rates. 
For other problems, such as average consensus [6], a number of heuristic methods have been proposed that improve the 
convergence time of the standard method 0,0. However, we are not interested in tailoring techniques to individual 
problems; our aim is to develop general-purpose schemes that retain the simplicity of the gradient method, yet improve 
the convergence factors. 

Even if the optimization problem is convex and the subgradient method is guaranteed to converge to an optimal 
solution, the rate of convergence is very modest. The convergence rate of the gradient method is improved if the 
objective function is differentiable with Lipschitz-continuous gradient, and even more so if the function is also strongly 
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convex. However, for smooth optimization problems several techniques allow for even better convergence rates. One 
such technique is higher-order methods, such as Newton's method |9|, which use both the gradient and the Hessian of 
the objective function. Although distributed Newton methods have recently been developed for special problem classes 
(e.g., [10], [11 1), they impose large communication overhead to collect global Hessian information. Another technique 
is the augmented Lagrangian dual method fl2l . This method was originally developed to cope with robustness issues of 
the dual ascent method, but it turns out that different variations of this technique, such as the method of multipliers [4|, 
tend to converge in fewer iterations than gradient descent. Recently a few applications of these algorithms to distributed 
optimization have been proposed |4), lfl3l but convergence rate estimates and optimal algorithm parameters are still 
unaddressed. A third way to obtain faster convergence is to use multi-step methods (9). These methods rely only 
on gradient information but use a history of the past iterates when computing the future ones. This paper explores 
the latter approach for distributed optimization, and addresses the design, convergence properties, optimal step-size 
selection, and robustness of networked multi-step methods. Moreover, we also apply the developed techniques to three 
important classes of distributed optimization problems. 

This paper makes the following contributions. First, we develop an multi-step weighted gradient method that maintains 
a network-wide constraint on the decision variables throughout the iterations. The accelerated algorithm is based on the 
heavy ball method by Polyak [ 14 1 extended to the networked setting. We derive optimal algorithm parameters, show that 
the method has linear convergence rate and quantify the improvement in convergence factor over the gradient method. 
Our analysis shows that method is particularly advantageous when the eigenvalues of the Hessian of the objective 
function and/or the eigenvalues of the graph Laplacian of the underlying network have a large spread. Second, we 
investigate how similar techniques can be used to accelerate dual decomposition across a network of decision-makers. 
In particular, given smoothness parameters of the objective function, we present closed-form expressions for the optimal 
parameters of an accelerated gradient method for the dual. Third, we quantify how the convergence properties of the 
algorithm are affected when the algorithm is tuned using misestimated problem parameters. This robustness analysis 
shows that the accelerated algorithm endures parameter violations well and in most cases outperforms its non-accelerated 
counterpart. Finally, we apply the developed algorithms to three case studies: networked resource allocation, consensus, 
and network flow control. In each application we demonstrate superior performance compared to alternatives from the 
literature. 

The paper is organized as follows. In Sectionlll] we introduce our networked optimization problem. Sectionlnllreviews 



multi-step gradient techniques. Section IV proposes a multi-step weighted gradient algorithm, establishes conditions 
for its convergence and derives optimal step-size parameters. Section [V] develops a technique for accelerating the dual 
problem based on parameters for the (smooth) primal. Section [VI] presents a robustness analysis of the multi-step 
algorithm in the presence of uncertainty. Section |VII| applies the proposed techniques to three engineering problems: 
resource allocation, consensus and network flow control; numerical results and performance comparisons are presented 



for each case study. Finally, concluding remarks are given in Section VIII 
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II. Assumptions and problem formulation 

This paper is concerned with collaborative optimization by a network of decision-makers. Each decision-maker v is 
endowed with a loss function /„ : R 4 R, has control of one decision-variable x v € R, and collaborates with the 
others to solve the optimization problem 

minimize J2vev fvM 
subject to Ax = b 

for given matrices A 6 R mx ™ and b <E R m . We will assume that b lies in the range space of A, i.e. that there exists 
at least one decision vector x that satisfies the constraints. 

The information exchange between decision-makers is represented by a graph Q — (V, £) with vertex set V = 
{1,2,..., n} and edge set £ C V x V. Specifically, at each time t, we will assume that decision-maker v has access 
to Vf w (x w (t)) for all its neighbors w G M v = {w | (u, «;) G £}. 

Most acceleration techniques in the literature (e.g. 03), fl6l . H3) require that the loss functions are smooth and 
convex. Similarly, we will make the following assumptions: 

Assumption 1: Each loss function f v is convex and twice continuously differentiable with 

lv < V 2 / t ,(x„) < u v , V^t, (2) 

for some positive real constants l v , u v with < l v < u v . 

Some remarks are in order. Let I = min^y l v , u — max„ e v u v and define f{x):= ^2 veV fv[%v)- Then, Assumption[T] 
ensures that f(x) is strongly convex with modulus I: 

f(v)>Kx) + {y-x) T V}{x)+ l -\\y^x\\ 2 M{x,y) 
and that its gradient is Lipschitz-continuous with constant it: 

f{y) < fix) + (y~ x) T Vf(x) + - x\\ 2 V(x, y) 
See, e.g, lfT31 Lemma 1.2.2 and Theorem 2.1.11] for details. Similarly, the Hessian of / satisfies 

U < V 2 /(a;) < ul Vx (3) 
Furthermore, Assumption [T] guarantees that ([T]) is a convex optimization problem whose unique optimizer x* satisfies 

Ax* = b, V/(x*) = A T n* (4) 
where /j,* € M. m is the (unique) optimal Lagrange multiplier for the linear constraints. 

III. Background on multi-step methods 
The basic gradient method for unconstrained minimization of a convex function f(x) takes the form 

x{k + 1) = x{k) - aVf{x{k)), (5) 
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where a > is a fixed step-size parameter. Assume that f(x) is strongly convex with modulus / and has Lipschitz- 
continuous gradient with constant u. Then if a < 2/u, the sequence {x(k)} generated by |5]) converges to x* at linear 
rate, i.e. there exists a convergence factor q g (0, 1) such that 

\\x(k + 1) < g||x(Jb) -a:*|| 

for all fc. The smallest convergence factor is q — (u — l)/(u+l) obtained for the step-size a — 2/(l + u) (see, e.g., |[T4ll ). 

While the convergence rate cannot be improved unless higher-order information is considered 04), the convergence 
factor q can be meliorated by accounting for the history of iterates when computing the ones to come. Methods in 
which the next iterate depends not only on the current iterate but also on the preceding ones are called multi-step 
methods. The simplest multi-step extension of the gradient method is 

x(k + 1) = x{k) ~ aVf{x{k)) + (5 (x(k) - x(k - 1)) (6) 

for fixed step-size parameters a > and (3 > 0. This technique, originally proposed by Polyak, is sometimes called the 
heavy-ball method based on the physical interpretation of the added "momentum term". For a centralized set-up, Polyak 
derived the optimal step-size parameters and showed that these guaranteed a convergence factor of (y/u— y/l) / (y/u+y/l) 
, which is always smaller than the convergence factor for the gradient method and significantly so when y/u/y/l is 
large. 

In the following sections, we will develop multi-step gradient methods for network-constrained optimization, analyze 
their convergence properties, and develop techniques for finding the optimal algorithm parameters. 

IV. A MULTI-STEP WEIGHTED GRADIENT METHOD 

In the absence of constraints, ([T]i is trivial to solve since the objective function is separable and each decision-maker 
could simply minimize its loss independently of the others. Hence, it is the existence of constraints that makes ([T]i 
challenging. In the optimization literature, there are essentially two ways of dealing with constraints. One way is to 
project the iterates onto the constraint set to maintain feasibility at all times; such a method will be developed in this 
section. The other way is to use dual decomposition to eliminate couplings between decision-makers and solve the 
associated dual problem; we will return to such techniques in Section [V] 

Computing the Euclidean projection onto the constraint of ([T]i typically requires the full decision vector x, which 
is not available to the decision-makers in our setting. An alternative, explored e.g. in |fl8l , is to consider weighted 
gradient methods which use a linear combination of the information available to nodes to ensure that iterates remain 
feasible. For our problem ([T]) the weighted gradient method takes the form 

x{k + l)=x{k)-aWVf{x{k)) (7) 

where W £ R, nx ™ j s a weight matrix that satisfies the sparsity constraint that W vw = if v ^ w and (v, w) $ £. In 
this way, the iterations (|7]i read 

x v (k+ 1) = x v {k) - a W VW V f w (x w (k)) 
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and can be executed by individual decision-makers based on the information that they have access to. If W satisfies 

AW = WA T = (8) 

then any initially feasible x(0) will always remain feasible. While the constraints on W might appear restrictive, it 
is possible to construct appropriate weight matrices for many applications. The following examples describe two such 
cases. 

Example 1: When the decision-makers are only constrained by a total resource budget, ([l} reduces to 

minimize }~2veV M x v) 
subject to J^vev x v = x tot 

A distributed gradient method for this problem was developed in [ 19 1. Later, 1 18 1 interpreted these as a weighted gradient 
method and developed techniques for computing the weight matrix W that minimizes the guaranteed convergence factor. 

Example 2: Consider a scenario where the decision-makers have to find a common decision x that minimizes the 
total cost 

minimize £ veV /„(a;) 

We can rewrite this problem in our standard form ([TJ by introducing local decision variables x v : 

minimize £„ eV (9) 
subject to x v — x w V(u, w) G £ 

Note that in vector form, the constraint of ^ reads Ax — where A £ Rl £ l x l v l is the incidence matrix of the graph 
Q. Next, we will show that the gradient iterations for the dual problem of (|9| has the structure of a weighted gradient 
method in the primal variables. To this end, we form the Lagrangian L(x,n) — f(x) — fi T Ax and the dual function 

d(n) = inf L(x, /i) = inf f(x) — fi T Ax 

X X 

Under Assumption [I] the Lagrangian has a unique minimizer x*(fi) — (V f )^ 1 (A T p) and the dual function is 
continuously differentiable with Vd(/x) = — Ax*(p). Hence, the iterations 

p(k + 1) = n{k) - aAx(k) 
x{k + l) = Vf- 1 {A T p(k+l)) 

will converge to a primal-dual optimal pair for appropriately chosen step-size a. Introducing z(k) :— A T x(k) and 
multiplying both sides of the iterations by A T , we obtain 

z(k + l) = z(x)-aWVf- 1 (z(k)) 
x(k + l) = Vf- 1 (z(k + l)) 

Note that W = A T A is the graph Laplacian of Q and that W satisfies the sparsity constraint for distributed execution 
detailed above. One can readily verify that W has a simple eigenvalue at for which Wl = 0. 

One important application of this technique is to distributed averaging, in which nodes should converge to the 
network-wide average of constants c v held by each node v e V. This average can be found by solving Q with 
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fv{%v) = (xv — c v ) 2 /2 (since its optimal solution is the average of the constants c v ). The corresponding iterations 
{10) read 

z[k + 1) = z(k) - aW (z(k) - c) 
x(k + 1) = z(k + 1) + c 



We will return to these iterations and their accelerated counterparts in Section VII 



A. A multi-step weighted gradient method and its convergence 

The examples indicate that variants of the weighted gradient method with improved convergence factors could also 
allow to speed up the convergence of network-wide resource allocation and consensus processes. To this end, we 
consider the following multi-step variant of the weighted gradient iteration 



x(k + 1) = x(k) - aWVf(x) + P (x(k) - x(k - 1)) 



(11) 



Under the sparsity constraint on W detailed above, these iterations can be implemented by individual decision-makers. 
Moreover, (|8) ensures that if x(l) and x(0) satisfy the constraints of ([T) then every iterate produced by (Hi will 



also be feasible. The next theorem characterizes the convergence of the iterations (Hi and derive optimal step-size 
parameters. 

Theorem 1: Consider the optimization problem ([T) under Assumption [TJ and let x* denote its unique optimizer. 
Assume that W has m < n eigenvalue at and satisfies AW — and WA T = 0. Let H — V 2 /(x*) and = 
Xi(WH) = ■■■ = X m (WH) < X m+1 (WH) = X < ■ ■ ■ < X n (WH) = X be the magnitude of eigenvalues of WH. 
Then, for 



< /? < 1, 



Q<a< uKW) 



the iterates (111 converge to x* at linear rate 

\\x(k + r)-x*\\<q\\x{k)-x*\\ Vfc > 

with q = max {\/~/3, |1 + /3 — aX\ — y/]3, |1 + /3 — aX\ — %//?}• Moreover, the minimal value of q is 

, VA- JX 



'X + JX 



obtained for step-sizes a = a* and /3 = (3* where 

, / 2 



\VX + y/X / 

Proof: See the appendix for all the proofs. 



/ x + ^x / 



Similar to the discussion in Section III it is interesting to investigate when (111 significantly improves over the single 



step algorithm. In JTS), it is shown that the best convergence factor of the weighted gradient iteration (|7) is 

A — A 



X + X 
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One can verify that q* < i.e. the multi-step iterations can always be tuned to converge faster. Moreover, the 
improvement in convergence factor depends on the quantity k = A /A: when k is large, the speed-up is roughly 
proportional to ^fn. In the networked setting, there are two reasons for a large value of k. One is simply that the 
Hessian of the objective function is ill-conditioned, so that the ratio u/l is large. The other is that the matrix W is ill- 
conditioned, i.e. that X n (W) / X m +i(W) is large. As we have seen in the examples, the graph Laplacian is often a valid 
choice for W. Thus, the topology of the underlying graph directly impacts the convergence rate (and the convergence 
rate improvements) of the multi-step weighted gradient method. We will return to this in detail in Section |VII| 

In many applications, we will not know H = V 2 f(x*), but only bounds such as j3J. The next result can then be 
useful 



Proposition 1: Let X w = l\ m+ i(W) and Xw = uX n (W). Then X w < A and Xw > X. Moreover, the step-sizes 

2 / /= — „ \ 2 




2 

\ v -W + v x w 
guarantee that ( fTT| > converges to x* at linear rate 

\\x{k + l)-x*\\<q\\x{k)-x*\\ Vfc, 

where 

V X\V + y Xyy 

B. Optimal weight selection for tlie multi-step method 

The results in the previous subsection provide optimal step-size parameters a and /3 for a given weight matrix W . 
However, the expressions for the associated convergence factors depend on the eigenvalues of WH and optimizing the 
entries in W jointly with the step-size parameters can yield even further speed-ups. We make the following observation. 

Proposition 2: Under the hypotheses of Proposition [T] 

(i) If H is known, then minimizing the convergence factor q* is equivalent to minimizing A/ A. 

(ii) If H is not known, while I and u in Q are, then the weight matrix that minimizes q is the one with minimal 
value of X n {W)/X m+1 {W). 

The next result shows how the optimal weight selection for both scenarios can be found via convex optimization. 
Proposition 3: Let A4 be the span of real symmetric matrices with the sparsity pattern induced by Q, i.e. 

M = {M e S n | S vw = if v £ w and(>, w) £}. 

Then the problem of minimizing A/ A is equivalent to 

minimize t 

subject to 7„_ m < P t H 1 / 2 ujH 1 / 2 P < tl n - m ( 12 ) 
H^ujH 1 ' 2 e M, H^ljH^V = 0, 

where V = [vi,--- ,v m ] € R nxm is the eigenvector space corresponding to the zero eigenvalues of WH 1 / 2 and 
P = [pi,p2 ■ ■ ■ Pn-m] G R nxn ^ m is a matrix of unit vectors spanning V- 1 . 
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Note that when we only want to minimize the condition number of W subject to the structural constraints, we simply 
set H = I in the formulation above. 

Remark 1: The lower bound in ( |T2] > is rather arbitrary: any scaled matrix -fW for 7 g R+ has the same condition 
number as W, and if if a* and /3* are the optimal step-sizes for the matrix W, then a = a* /j and j3 — (5* are optimal 
for jW. 

V. A MULTI-STEP DUAL ASCENT METHOD 

An alternative approach for solving Q is to use Lagrange relaxation, i.e. to introduce Lagrange multipliers /1 e R m 
for the equality constraints and solve the dual problem. The dual function associated with ([T]i is then 

d(ji) = inf f(x) + fi T (Ax -b) = -A(-AV) - V T b (13) 

X 

where f*(y) = sup x y T x — f(x) is the conjugate function of /. The dual problem is to maximize the dual function 
with respect to /z, i.e., 

minimize ~d(/i) = /*(— A T /i) + 6 T /i . 

Recall that if / is strongly convex then /* and hence — d are convex and continuously differentiable ll20l . Hence, in 
light of our earlier discussion, it is natural to attempt to solve the dual problem using the multi-step iteration 

fi(k + 1) = n(k) + aVd{n(k)) + p(p(k) - ft(k - 1)). (14) 

In order to find the optimal step-sizes and estimate the convergence factors of the iterations, we need to be able to 
bound the convexity modulus of and bound the Lipschitz constant of its gradient. The following observation is 
a first step towards this goal: 

Lemma 1: Consider the optimization problem ([T]i with associated dual function ( 13 1. Let / be a continuously 
differentiable and closed convex function. Then, 

(i) If / is strongly convex with modulus I, then — Vrf is Lipschitz continuous with constant X n (AA T )/l. 

(ii) If V/ is Lipschitz continuous with constant u, then — d is strongly convex with modulus Xi(AA T )/u. 

These dual bounds can be used to find step-sizes with strong performance guarantees for the dual iterations. Specifically: 

Theorem 2: Consider the smoothness bounds stated in Lemma [T] Then, the accelerated dual iterations < fT~4] > converge 
to fi* at linear rate with the guaranteed convergence factor 

q y/u\ n (AA T ) + y/lX^AA^)' 

obtained for step-sizes: 



y/ uX n (AA T ) — ^lXi(AA T ) \ ' 



\y/u\ n (AA T ) + y/lX^AAT) J ' \y/uX n {AA T ) + y/lX^AAT) J 

The advantage of Theorem [2] is that it provides step-size parameters with guaranteed convergence factor using readily 
available data of the primal problem. How close to optimal these results are depends on how tight the bounds in 
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Lemma [T] are. If the bounds are tight, then the step-sizes in Theorem [2] are truly optimal. The next example shows that 
a certain degree of conservatism may be present, even for quadratic programming problems. 
Example 3: Consider the quadratic minimization problem 

minimize ^x T Qx 
subject to Ax = b 

where Q £ <S>™, nonsingular A E R nxn and b £ R". This implies that the objective function is strongly con- 
vex with modulus Ai(Q) and that its gradient is Lipschitz-continuous with constant X n (Q). Hence, according to 
Lemma[T| — d is strongly convex with modulus Xi{AA r )/X n {Q) and its gradient is Lipschitz continuous with constant 
X n (AA T ) / Xi(Q). However, direct calculations reveal that 

d{n) = —fj, T AQ- 1 A T n - n T b 

from which we see that —d has convexity modulus Xi(AQ^ x A T ) and that its gradient is Lipschitz continuous with 
constant X n (AQ~ 1 A T ). By [21, p. 225], these bounds are tighter than those offered by Lemma [l] Specifically, for 
congruent matrices Q^ 1 and AQ~ 1 A r there exists nonnegative real numbers 6k such that Xi(AA T ) < 9^ < X n (AA T ) 
and OkXk{Q~ l ) = A/ C (AQ _1 J 4 T ). For k = 1 and n we obtain 

For some important classes of problems, the bounds are, however, tight. One such example is the average consensus 



application considered in Section VII 



VI. Robustness analysis 

The proposed multi-step methods have significantly improved convergence factors compared to the gradient iterations, 
and particularly so when the Hessian of the loss function and/or the graph Laplacian of the network is ill-conditioned. 
However, to design the optimal step-sizes for the multi-step methods one needs to know the upper and lower bounds 
on the Hessian and the largest and smallest non-zero eigenvalue of the graph Laplacian. These quantities can be hard 
to estimate accurately in practice. It is therefore important to analyze the sensitivity of the multi-step methods to errors 
in these parameters to assess if the performance benefits prevail when the step-sizes are tuned based on misestimated 
parameters. Such a robustness analysis will be performed next. 

Let A and A denote the estimates of A and A available when tuning the step-sizes. We are interested in quantifying how 
the convergence properties, and the convergence factors, of the gradient and the multi-step methods are affected when 
A and A are used in the step-size formulas that we have derived earlier. Theorem [7] provides some useful observations 
for the multi-step method. The corresponding results for the weighted gradient method are summarized in the following 
lemma: 

Lemma 2: Consider the weighted gradient iterations and let A and A denote the largest and smallest non-zero 
eigenvalue of WH, respectively. Then, for fixed step-size < a < 2/A converges to x* at linear rate with 
convergence factor 

qc = max { 1 1 — aX\ , 1 1 — aX\ } 
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(a) Stability regions (b) Different perturbation regions 



Fig. 1, Perturbations in the white and gray area correspond to the stable and unstable regions of multi-step algorithm respectively, (b) Multi-step 
algorithm outperforms gradient iterations in (e,e) £ C\Q.4. For symmetric errors in Q4 (along the line e = — g) gradient might outperform 
multi-step algorithm. This condition is depicted in the plot as a solid line. 



The minimal value q* G — (A — A)/(A + A) is obtained for the step-size a = 2/ (A + A). 

Combining this lemma with our previous results from Theorem [T] yields the following observation. 

Proposition 4: Let A and A be estimates of A and A, respectively, and assume that < A < A. Then, for all values 
of A and A such that A < A + A, both the weighted gradient iteration ([v} with step-size 



o 



= 2/(A + A) (15) 



and the multi-step method variant (111 with 

a= ( j 1 \ , p= ( y ^n~ y ® \ (i6) 

V V A + V2/ VvA + VA/ 

converge to the optimizer x* of ([TJ. 

In practice, one should expect that A is overestimated, in which case both methods converge. However, convergence 
can be guaranteed for a much wider range of perturbations. Figure [T] considers perturbations of the form A = A + e 
and A = A + e . The white area is the locus of perturbations for which convergence is guaranteed, while the dark area 
represents inadmissible perturbations which render either A or A negative. Note that both algorithms are robust to a 
continuous departure from the true values of A and A, since there is a ball with radius ^/?>X/2 around the true values 
for which the methods are guaranteed to converge. 

Next, we proceed to compare the convergence factors of the two methods when the step-sizes are tuned based on 
inaccurate parameters. The following Lemma is then useful. 

Lemma 3: Let A and A satisfy < A < A + A. The convergence factor of the weighted gradient method Q with 
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(a) Symmetric perturbations in Q4 
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(b) General perturbation in Q4 



Fig. 2. (a) Convergence factor of multi-step and gradient algorithms under the condition described by \19\ . Solid lines belong to q while the dashed 
lines depict qc- (b) Level curves of q — q~Q around the origin for (e,e) £ Q4. 



step-size ( f]~5] > is given by 



2A/(A + A)-1 ifA + A<A + A 
<h = { _ ~ ~ (17) 

1 - 2 A/ (A + A) otherwise, 



while the multi-step weighted gradient method ( fTT| with step-sizes ( 16 1 has convergence factor 

^l + £-5A|-V^,|l + /9-SA|-\M ( 18 > 



The convergence factor expressions derived in Lemma [3] allow us to come to the following conclusions: 
Proposition 5: Let A = A + e, A = A + e and define the set of perturbation under which the methods converge 

C = {(£,£) I e> -A, e> -A, s + e> -A} 

and the fourth quadrant in the perturbation space Q 4 = {{s, e) \ e < n e > 0}. Then, for all (e,e) E C\Q 4 , it holds 
that q < qc- However, there exists (s,e) £ Q4 for which the scaled gradient has a smaller convergence factor than the 
multi-step variant. In particular, for 

(e,e) G Q± and (A + g)/(A + e) > (A/A) 2 (19) 

the multi-step iterations (Hi converge slower than Q. 

Fig. 1(b) illustrates the different perturbations considered in Proposition [5] While the multi-step method has superior 
convergence rate for most perturbations, the troublesome region Q 4 is envisaged to be the most likely one in engineering 
applications. Because it represents the perturbations where the smallest eigenvalue is underestimated while the largest 
eigenvalue is overestimated. To shed more light on the convergence properties in this region, we perform a numerical 
study on a quadratic function with A = 1 and A varying from 2 to 100. We first consider symmetric perturbations 
e = —e, in which case the convergence factor of the gradient method is qc = 1~ 2/(1 + A/A) while the convergence 

shows the convergence factors as a function of the 



factor of the multi-step method is q = 1 — 2/ wl + A/A. Fig 



2(a) 



perturbation e = e. The convergence factor of the gradient iterations is insensitive to this class of perturbations, while 
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the performance of the multi-step iterations degrades with the size of the perturbation, and will eventually become 
inferior to the gradient. To complement this analysis, we also sweep over (§, e) £ C n Q4 and compute the convergence 
factors for the two methods for problems with different A. The plot in Fig. 2(b) indicates that when the condition 
number A/A increases, the area where the gradient method is superior (the area above the contour line) is shrinking. It 
also shows that when A tends to zero or A is very large, the performance of the multi-step method is severely degraded. 

VII. Applications 

In this section, we will apply the developed techniques to three classes of engineering problem for which distributed 
optimization techniques have received significant attention. These are resource-allocation subject to a network-wide 
resource-constraint, distributed averaging consensus, and Internet congestion control. In all cases, we will demonstrate 
that significant speed-ups can be achieved by direct applications of our results, even when compared to acceleration 
techniques that have been tailor-made to the specific problem class. 

A. Accelerated resource allocation 

Our first application is the distributed resource allocation problem under a network-wide resource constraint described 
in Example [T] This problem class was introduced in [19| and revisited by [18|, who developed optimal and heuristic 
weights for the corresponding weighted gradient iteration (j7]). We hence compare the multi-step method developed 
in this paper with the optimal and suboptimal tuning for the standard weighted gradient iterations proposed in |18|. 
Similarly to |[T8l we create problem instances by generating random networks and assigning loss functions on the form 
fv(x v ) = a v (x v — c v ) 2 + log[l + e\p(x v — d v )] to nodes. The parameters a v , b v ,c v and d v are drawn uniformly from 
the intervals [0,2], [—2,2], [—10,10 and [—10,10], respectively. In lfl8l it was shown that the second derivatives of 
these functions are bounded by l v = a v and u v = a v + b\ /4. 
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TABLE I 

RESOURCE ALLOCATION: GUARANTEED CONVERGENCE FACTORS 



Method 


Max-degree 


Metropolis 


Best Constant 


SDP 


Xiao-Boyd 


0.9420 


0.9318 


0.9133 


0.8952 


Multi-step 


0.8667 


0.8565 


0.8667 


0.7604 




50 100 150 200 250 300 

k 



Fig. 4. Comparison of standard, multi-step, shift-register, and Nesterov consensus algorithms using metropolis wights, simulation on a dumbbell 
of 100 nodes: log scale of objective function — x*\W versus iteration number k. algorithms start from common initial point x(0). 

Fig. [3] shows a representative example of a problem instance along with the convergence behavior for weighted 
and multi-step weighted gradient iterations for several weight choices. The optimal weights for the weighted gradient 
method can be found by solving a semi-definite program derived in |18|, and by Proposition [3] for the multi-step 
variant. In addition, we use the heuristic weights "best constant" and "metropolis" introduced in ifTSl . In all cases, we 
observe significantly improved convergence factors for the multi-step method. 

In addition to simulations, we compare the analytical expressions for the convergence factors of the weighted gradient 
and multi-step iterations. Table [I] again demonstrates superior performance of the multi-step method. In addition to the 
heuristic weights considered previously, we have also used the "max-degree" weight heuristic from iflSl . While this 
weight setting tends to be worse than "best constant" for the scaled gradient iterations, the two methods will always 
result in the same convergence factors for the multi-step method. This follows from Remark [T] and the fact that both 
heuristics generate weight matrices on the form -f£ where C is the Laplacian of the underlying graph and 7 is a 
positive scalar. 

B. Distributed averaging and consensus 

Our second application is devoted to distributed averaging. Distributed algorithms for consensus seeking have been 
researched intensively for decades, see e.g. (6), l22l . J23). Here, each node v in the network initially holds a value c v 
and coordinates with neighbors in the graph to find the network-wide average. Clearly, this average can be found by 
applying any distributed optimization technique to the problem 

minimize J2vev \( x ~ c ") 2 ( 20 ) 
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since the optimal solution to this problem is the network- wide average of the constants c v . In particular, we will explore 
how the multi-step technique described in Example [2] with our optimal parameter selection rule compares with the 
state-of-the art distributed averaging algorithms from the literature. 
The basic consensus algorithms use iterations on the form 

x v {k + 1) = Q vv x v (fc) + S Q vw x w {k) 7 x (21) 

where Q vw are scalar weights, and the node states are initialized with x v (0) — c v . The paper [24] provides necessary 
and sufficient conditions on the weight matrix Q = [Q vw ] for the iterations to converge to the network-wide average 
of the initial values, along with computational procedures for finding Q that minimizes the convergence factor of the 
iterations. 

Following the steps of Example |2j the optimization approach to consensus would suggest the iterations 

x(k + 1) = x(k) - aWx(k) (22) 

but use a 



with W = A T A where A is the incidence matrix of Q. These iterations are on the same form as (21 



particular weight matrix. The multi-step counterpart of ( 22 1 is 



x(k + 1) = ((1 + f3)I - aW) x(k) - /3x(k - 1) (23) 



In a fair comparison between the multi-step iterations ( |23| > and the basic consensus iterations, the weight matrices of 
the two approaches should not necessarily be the same, nor necessarily equal to the graph Laplacian. Rather, the weight 
matrix for the consensus iterations pi) should be optimized using the results from [24] and the weigh matrix for the 
multi-step iteration should be computed using Proposition [3] 

In addition to the basic consensus iterations with optimal weights, we will also compare our multi-step iterations 
with two alternative acceleration schemes from the literature. The first one comes from the literature on accelerated 
consensus and uses shift registers Q, ll25ll . Il26ll . Similarly to the multi-step method, these techniques use a history 



of past iterates, stored in local registers, when computing the next. For the basic consensus iterations (21 1, the shift 
register yields 

x(k + 1) = CQx(k) + (1 - ()x(k - 1) (24) 

The current approaches to consensus based on shift-registers assume that Q is given and design £ to minimize the 
convergence factor of the iterations. The key results can be traced back to Golub and Varga ll27l who determined the 
optimal ( and the associated convergence factor to be 



C = , q S R 



Wi-A£-i(Q) 

(25) 



/ J HUH ~ . / 

l + yi-A^Q) \ i + V 1 - A »-i(3) 

In our comparisons, the shift-register iterations will use the Q-matrix optimized for the basic consensus iterations and 
the associated £* given above. The second accleration technique that we will compare with is the order-optimal gradient 
methods developed by Nesterov ifTBI . While these techniques have optimal convergence rate, also in the absence of 
strong convexity, they are not guaranteed to obtain the best convergence factors. For the case of an objective function 
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which is strongly convex with modulus I and whose gradient is Lipschitz continuous with constant u, the following 
iterations are proposed in lfT31 : 

x(k + l) = x{k) -Vf(x(k))/u 
x(k + 1) = x(k + 1) + ^ U ~^l {x{k + 1) - x(k)) 
initialized with x(0) = x(Q). When we apply this technique to the consensus problem, we arrive at the iterations 

x(k + 1) = (I - aW) {x{k) + b(x(k) - x{k - 1))) (26) 

with parameters W = AA T ,a = A~ X (PT) and b = {yJ\ n {W) - V /A 2 (VK))/( V /A„(VK) + yJ\ 2 {W)). 

Fig. |4] compares the multi-step iterations (23 1 developed in this paper with (a) the basic consensus iterations (21 



with a weight matrix determined using the metropolis scheme, (b) the shift-register acceleration (|2"4|i with the same 



weight matrix and the optimal £, and (c) the order-optimal method (26i. The particular results shown are for a network 
of 100 nodes in a dumbbell topology. The simulations show that all three methods yield a significant improvement in 
convergence factors over the basic iterations, and that the multi-step method developed in this paper outperforms the 
alternatives. 



Several remarks are in order. First, since the Hessian of ( 20 1 equals the identity matrix, the speed-up of the multi-step 
iterations are proportional to i/k — -J 'X n (W) / 'X2CW). When W equals C, the Laplacian of the underlying graph, we 
can quantify the speed-ups for certain classes of graphs using spectral graph theory [28 1. For example, the complete 
graph has A 2 (£) = A n (£) so k = 1 and there is no real advantage of the multi-step iterations. On the other hand, for 
a ring network the eigenvalues of C are given by 1 — cos(27re)/|V|, so k grows quickly with the number of nodes, 



and the performance improvements of 23 i over (22 1 could be substantial. 

Our second remark pertains to the shift-register iterations. Since these iterations have the same form as ( |23] l, we 
can go beyond the current literature on shift-register consensus (which assumes Q to be given and optimizes Q and 
provide jointly optimal weight matrix and (^-parameter: 

Proposition 6: The weight matrix Q* and constant that minimizes the convergence factor of the shift-register 
consensus iterations |24| are 

Q* = i- e*w*, c = 1 + 13* 

where W* is computed in Proposition [5] (3* is given in Theorem [T] with H = I and 

\ 2 {W*) + \ n (W*) 

C. Internet congestion control 

Our final application is to the area of Internet congestion control, where Network Utility Maximization (NUM) has 
emerged as powerful framework for studying various important resource allocation problems, see, e.g., |29|, |30|, 
OTI . The vast majority of the work in this area is based on the dual decomposition approach introduced in ||29l . Here, 
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the optimal bandwidth sharing among S flows in a data network is posed as the optimizer of a convex optimization 
problem 

minimize Y\.u s (x s ) 

X 

subject to x s £ [m s , M s ] (27) 
Rx < c 

In this formulation x s is the communication rate of flow s, and the strictly concave and increasing function u s (x s ) 
describes the utility that source s has of communicating at rate x s . The communication rate is restricted to a bounded 
interval. Finally, R £ {0, l} ixS is a routing matrix, whose entries Ri s equal one if flow s traverses link I and zero 
otherwise. In this way, Rx is the total traffic on links, which cannot exceed the link capacities c £ W 1 . We make the 
following assumptions. 



Assumption 2: For the problem ( 27 1 it holds that 

(i) Each u s (x s ) is twice continuously differentiable and satisfies < / < — V 2 u s (a; s ) < u for x s £ [m s ,M s ] 

(ii) For every link /, there exists a source s whose flow only traverses I, i.e. Ri s = 1 and R[r s = for all I' ^ I. 
While these assumptions appear restrictive, they are often postulated in the literature (e.g. Il29l Assumptions C1-C4]). 
Note that under Assumption [2j the routing matrix has full row rank and all link constraints hold with equality at 



optimum. Hence, we can replace Rx < c in (27 i with Rx = c, introduce Lagrange multipliers /i for these constraint, 
and form the associated dual function 



d(p) = max y lu s {x s )-x s y Ri s Hi) + y jnci 

x s E[m s ,M s ] J 

Evaluating d(fi) amounts to solving an optimization problem in x. Since this problem is separable in x s , it can be 
solved by each source in isolation based on the sum of the Lagrange multipliers for the links that the flow traverses, 

x*(fi) = argmax u s (z) — zy^Ri s ^i (28) 

z£[m s ,M B ] > 

Similarly, each Lagrange multiplier update 

fx t (k + 1) - Mk) +a (j2 ~ <hJ (29) 

can be updated by the corresponding link based on local information: if the traffic demand on the link exceeds 
capacity, the multiplier is increased, otherwise it is decreased. It is possible to show that under the conditions that 
under Assumption [2] the dual function is strongly concave, differentiable and has a Lipschitz-continuous gradient 



Hence, by standard arguments, the updates (28 I, (29 1 converge to a primal-dual optimal point (x*, fj,*) for appropriately 
chosen step-size a. 

Our results from Section [V] indicate that substantially improved convergence factors could be obtained by the 
following class of multi-step updates of the Lagrange multipliers 

^(k + l) = [H^ + a^RisX^ik))-^ (30) 

To tune the step-sizes in an optimal way, we bring the techniques from Section [V] into action. To do so, we first bound 
the eigenvalues of RR T using the following result: 
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Lemma 4: Let R £ {0, l} LxS satisfy Assumption [2] Then 

1 < Xi(RR T ), X n (RR T ) < / max s m£ „ 



where Z max = max s J2i R is and s max = max; J2 S R ls- 

The optimal step-size parameters and corresponding convergence factor now follow from Lemma [4] and Theorem [2j 
Proposition 7: Consider the network utility maximization problem ( 27 1 under Assumption [2] Then, for < (3 < 1 



and < a < 2(1 + ft)/(ul max s max ) the iterations (28i and (30 1 converge linearly to a primal-dual optimal pair. The 
step-sizes 



a 




+ Vi 

ensure that the convergence factor of the dual iterates is 

<7num = —j= 

Note that an upper bound of the Hessian of the dual function was also derived in ||29l . However, strong concavity 
was not explored and the associated bounds were not derived. 

We now comment on the steady behavior of accelerated link price algorithm P0) . Due to the saturation assumption 
as k — > 00, close to the equilibrium, we have a Ri s x*(/j,(k)) — q) — > 0. 

Hi{k + 1) = m{k) + P{m[k) - m{k - 1)) 

m {k + 1) - = W(A) - A** + /8 ((«(*) - A*?) - - 1) ~ A*?)) (3D 
ef (fc + 1) = efCA;) + - ef (fc - 1)), 

where /i* is the optimal price of link I and e^(fc) = ^i(k) — ^ is the distance between the current price and the 
optimal price of link It is easy to note that pT| corresponds to a PD controller for driving the price of link I to its 
optimal value. Hence, it is obvious that asymptotically (f30]> behaves like a PD controller. 
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To compare the gradient iterations with the multi-step congestion control mechanism, we present representative 
results from a network with 10 links and 20 flows which satisfies Assumption [2] The utility functions are on the form 
— (M s — x s ) 2 /2 and m s = and M s = 10 5 for all sources. As shown in Figure [5] substantial speedups are obtained. 

As a final remark, note that Lemma ^underestimates Ai and overestimates A n , so we have no formal guarantee that 
the multi-step method will always outperform the gradient-based algorithm. However, in our experiments with a large 
number of randomly generated networks, the disadvantageous situation identified in Section |VI] never occurred. 



VIII. Conclusions 

We have studied accelerated gradient methods for network-constrained optimization problems. In particular, given 
the bounds of the Hessian of the objective function and the Laplacian of the underlying communication graph, we 
derived primal and dual multi-step techniques that allow to improve the convergence factors significantly compared 
to the standard gradient-based techniques. We derived optimal parameters and convergence factors, and characterized 
the robustness of our tuning rules to errors that occur when critical problem parameters are not known but have to be 
estimated. Our multi-step techniques were applied to three classes of problems: distributed resource allocation under 
a network-wide resource constraint, distributed average consensus, and Internet congestion control. We demonstrated, 
both analytically and in numerical simulations, that the approaches developed in this paper outperform, and often 
significantly outperforms, alternatives from the literature. 

Appendix 

A. Proof of Theorem [JJ 

Let x* be the optimizer of ([TJ. The Taylor series expansion of Vf(x(k)) around x* yields 

WWf(x(k)) = W{Vf{x*) + W 2 f(x*)(x(k) - a?*)) 
= WW 2 f(x*)(x(k) -x*) 
since WV/(i*) = by Q and (|8). Introducing 

z{k) = [x(k) - x*, x(k - 1) - a;*] T , 

we can thus re-write (JTTJ as 



z(k + l) = 



B -/3I 
I 



z(k)+o(z(k) 2 ), 



(32) 



where B = (1 + 0)1 — aWH and H = W 2 f(x*). Now, for non-zero vectors v\ and V2, consider the eigenvalue 
equation 

B -/3I 
. I 

Since Vi = X(T)v2, the first row can be re-written as 

(-A 2 (r)/ + \{T)B-(3l)v 2 = 0. (33) 



Vl 


= A(r) 


Vl 


V2 




V 2 
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Note that (33 i is a polynomial in B and B is in turn a polynomial in WH, Hence, if /i and A denote the eigenvalues 
of B and WH, respectively, we have 

A 2 (r)-(l + /3-aA)A(r)+£ = 0. (34) 

The roots of d34ll have the form 



A(r) = l+^-|A±VA A = (1 + /3 _ aA)2 _ 4/3 . (35) 

If A > 0, then |A(r)| < 1 is equivalent to 

(1 + B - aX) 2 - 4/3 > 

-2<l + /3-aA± V(l + P - ot\) 2 - 4/3 < 2. 
which, after simplifications, yield 

< a < 2(1 + /3)/A. 

On the other hand, if A < 0, then |A(r)| < 1 is equivalent to 

(1 + B — aX) 2 — A 

< ^ - — < 1, 

4 

which, after similar simplifications, implies that < j3 < 1. 

Note that the upper bound for a gives a necessary condition for A. Here we find an upper bound for this eigenvalue. 
Since H is a positive diagonal matrix, under similarity equivalence we have WH ~ H X I 2 W HH^I 2 = H^WH 1 / 2 . 
Without loss of generality assume x G R™ and x T x = 1, Then x T WHx = x J H X I 2 W H x l 2 x — y T Wy, where y = 
H 1/2 x. Clearly, for y T Wy it holds that 

Xi(W)y T y < y T Wy < X n (W)y T y. 
Now, I < y T y — x T Hx < u, implies l\i(W) < x T WHx < uX n (W). and hence, a sufficient condition on a reads 

uX n {W) 

Having proven the sufficient conditions for convergence stated in the theorem, we now proceed to estimate the 
convergence factor. To this end, we need the following lemmas describing the eigenvalue characteristics of WH and 

r. 

Lemma 5: If W has m < n zero eigenvalues, then WH has exactly n — m nonzero eigenvalues, i.e. X\(WH) = 
■■■ = X m {WH) = 0, Xi{WH) ^ i = m + 1, • • • ,n. 

Proof: From ||2TI we know that if and only if all the principal submatrices of a matrix have nonnegative 
determinants then that matrix is positive semi definite. Note that the i-th principal submatrix of WH, WHi, is obtained 
by multiplication of the corresponding principal submatrix of W, Wi by the same principal submatrix of H, Hi from 
the right, and we have det(WH,) = det(W / i ) det(Hi). We know det( J ff l ) > and det(Wi) > because W > 0, thus 
det(WHi) > and WH is positive semidefinite. Furthermore mnk(WH) — mnk(W). So mnk(WH) = n — m and 
it means that WH has exactly m zero eigenvalues. ■ 
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Lemma 6: For any WH such that \{WH) = for i = 1, • • • , m, and Xi(WH) ^ 0, for i = m + 1, n., the 
matrix T has m eigenvalues equal to 1 and the absolute values of the rest of the 2n — m eigenvalues are strictly less 
than 1. 

Proof: For complex Xi(T) we have |Ai(T)| = /3 < 1. For real-valued Ai(r), on the other hand, the bound on a 
implies that a(X(WH)) is a decreasing function of A. In this case, < a < - guarantees that <C ol <C a \wh) 
for any < Xi(WH) < A. Note that if we set a tighter bound on a, then it does not change satisfactory condition for 
having |A(r)| < 1. Only when Xi(WH) — 0, we have Y\to. x ^>q a — oo. For this case, if we substitute \i(WH) = in 



([34} we obtain A 2i -i(r) = 1 and A 2i (r) = (3 < 1. ■ 
We are now ready to prove the remaining parts of Theorem [T] By the Lemmas above, T has m < n eigenvalues equal 
to 1, which correspond to the m zero eigenvalues of W implied by the optimality condition (|8). Hence, minimizing 



m + 1-th largest eigenvalue of (32i leads to the optimum convergence factor of the multi-step weighted gradient 
iterations ( fPT) . Ca 
eigenvalues of T, 



iterations ( [11) . Calculating Ar — min max |Aj(r)| yields the optimum a* and /?*. Considering that (35 1 are the 

a,/3 l<j<2n—m 



A r = -max{\l+/3-aAi\ + + /? - aA^) 2 - , 

where A; = At(W^-ff), Vi = m + 1, ••, There are two cases: 

Case 1: (1 + /3 — aA^) 2 — 4/3 > 0. Then, a and 6 are non-negative and real with a >b. Hence, a 2 — b 2 > (a — b) 2 



and consequently a + \/ a 2 — b 2 > 2a — b > b. 

Case 2: (1 + j3 — a\i) 2 — 4/3 < 0. In this case, Aj(T) is complex-valud. Consider c,d £ R + with c < d. Then, 
|c + x/^ 2- ^! = \/c 2 - c 2 + d = Vd > 2c - Vd. 

If we substitute these results into Ar with a = 1 + /3 — aA,, b = 2-^/3 , c = |1 + /3 — aXi\ and d = 4/3 we get 

A r > mBx{y/f} t max{\l + p-aXi\ - v^}} . 
which can be expressed in terms of A and A: 

A r > max { y/]3, \ 1 + /3 - aA| - yfp, |1 + (3 ~ aX\ - V^} • (37) 



It can be verified that 

max {|l + /3-aA|- V^, |1 + /3 - «A| - v 7 ?} 

> |l + /3-a'A|- V?, 

where a' is such that |1 + j3 — a'X\ = |l + /3 — c/A , i.e. 

, 2(1 + /3) 

a = 

A + A 

From ([37]), ([38]) and ([39]), we thus obtain 



(38) 



(39) 



Ar > max + - 7^} • (40) 

Again, the max-operator can be bounded from below by its value at the point where the arguments are equal. To this 
end, consider f3' whcih satisfies 

^_(l + «^-vF, 
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that is, 



Since 

max i a/S. fl + 8) _ _ 
A + A 



/3,(l + / 3)^4-V^^ > V^ 7 , (42) 



we can combine B2|) and (|40b to conclude that 



A r > v^ 7 = i f (43) 
VA + VA 



Our proof is concluded by noting that equality in ( 43 1 is attained for the smallest non-zero eigenvalue of T and the 
optimal step-sizes 8* and a* stated in the body of the theorem. 

B. Proof of Proposition [7J 

As shown in the proof of Theorem 1, the eigenvalues of WH are equal to those of H 1 / 2 WH 1 / 2 . According to ET1 
p.225] for matrices W and H 1 / 2 WH 1 / 2 , there exists a nonnegative real number Of. such that Xi(H) < Of. < \ n (H) 
and X^H^WH 1 / 2 ) = 9 k X k (W). Letting k = m + 1 and k = n, yields A > IX W and A < uX w . The rest of the 
proof is similar to that of Theorem [T] and is omitted for brevity. 

C. Proof of Proposition [2] 

Direct calculations yield q* = (VJ- y/X)/(VJ+y/X) = l-2/((A/A) 1 / 2 + l). Similarly, q= l-2/((A W //A VF ) 1 / 2 + 
1). Hence, minimizing q* and q are equivalent to minimizing the condition number of WH and W, respectively. 

D. Proof of Proposition [3] 

Similar to the proof of Theroem[T]it can be seen that the eigenvalues of ujH are equal to the ones of f2 = H^^ujH 1 / 2 . 
To have the m zero eigenvalues of fl corresponding to the condition WA T = in ([8j, one needs to condition V 
in ( fT~2] > to belong to the kernel of WH 1 / 2 . Moreover, to restrict the search of to to the nonzero eigenspace of W, we 
should have x^Vtx > for all nonzero x € V . This condition is equivalent to having y T P T £lPy > for all nonzero 
y £ R" and P being the matrix of vectors spanning V . 

E. Proof of Lemma [7] 

To prove (a) we exploit the equivalence of Z-strong convexity of /(•) and 1/Z-Lipschitz continuity of V/*. Specially 
according to ll32l Theorem 4.2.1], for nonzero Zi,Z2 £ R™, Lipschitz continuity of V/* implies that 

(V/*(*i)-V/*(* 2 ),«i-* 2 ) < ]\\zi-z 2 \\ 2 

Now, for — Vd(z) = — AV/*(— A T z) + b, change the right hand side of above inequality to have 

(-Vd(-zi) +Vd{z 2 ),z 1 - z 2 ) 

- (V/4-A T z!) - V/ t (-A T z 2 ), -A T ( Zl ~ z 2 )). 

In light of 1/^-Lipschitzness of V/* we get 

(Vf*(-A T Zl ) - VU(-A T z 2 ),-A T (z 1 - z 2 )) 

^ 1 II A T f \||2 - ^n(AA T ) 2 
<j\\-A (Zl-Z2)\\ < 1 Fl-Z2|| ■ 
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(b) According to [32, Theorem 4.2.2], If V/(-) is u-Lipschitz continuous then /* is 1/u-strongly convex, i.e., for 
non-identical Z\ , z 2 6 R n 

(Vf ir (z 1 )-Vf ir (z 2 ),z 1 -z 2 ) > -\\z t -z 2 \\ 2 

u 

One can manipulate above inequality as 

(-VcZOi) +Vd(z 2 ),zi - z 2 ) 

= (VU(-A T Zl ) - VU-A T z 2 ),-A T (z 1 - z 2 )) 

>-\\-A T (zr-z 2 )f >h^ll\ W _ 22r , 
u u 

It is worth noting that here we assume that A is row full rank. 

F. Proof of Theorem [2] 

The result follows from Lemma[l]and Theorem[T]with T4 7 = I and noting that (X 1 (AA T )/u)I < H < (X n (AA T ) /T)I. 

G. Proof of Lemma [2] 

Since / is twice differentiable on [x*,x], we have 

V/(z) = V/0*)+ f \7 2 f(x* + t(x - x*)){x - x*)dr 
Jo 

= A T n* + H(x)(x -x*), 

where we have used the fact that Vf(x*) — A T fi* and introduced H(x) = f Q V 2 f(x* + t(x — x*))dr. By virtue of 
Assumption[T] H{x) is symmetric and nonnegative definiteand satisfies II < H(x) < ul lfl4ll . Hence from (|7]l and ([8]l 

\\x{k + 1) - = \\x{k) - x* - aWVf(x(k)) \\ 

= \\x(k) - x* - aW(A T [i* + H(x(k))(x(k) - jb*))|| 

= - aWH(x(k)))(x(k) - x*)\\ 

< \\I -aWH(x(k))\\\\x{k) - x*\\. 

The rest of the proof follows the same steps as |[T4l Theorem 3]. Essentially for fixed step-size < a < 2/ A, the 
iterations in |7| converge linearly with factor q 2 = max{|l — aA|,|l — a\\}. The minimum convergence factor 

q%, = — -= is obtained by minimizing qc over a, which yields the optimal step-size a* = — f=. 

H. Proof of Proposition [5] 

According to Lemma [2] the weighted gradient iterations |7]) with estimated step-size a = 2/(A + A) will converge 
provided that < a < 2/A, i.e. when A < A + A. 

For the multi-step algorithm (TT) , Theorem [l] guarantees convergence if < /3 < 1, 0<5< 2(1 + /3)/A. The 
assumption < A < A implies that the condition on f3 is always satisfied. Regarding a, inserting the expression for (3 
in the upper bound for a and simplifying yields 

4 2(A + A) 1 



VA + VAJ (VA + VA 
which is satisfied if < A < A + A. The statement is proven. 
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/. Proof of Lemma [3] 

We consider two cases. First, when A + A < A + A combined with the assumption that < A < A + A yields 5A > 1, 
which means that |1 — 5A| = 5A — 1. Moreover, aX — 1 > 1 — 5A, so by Lemma [5] 

qc = max{5A — 1, max{l — 5A, aX — 1}} = aX — 1 
= 2A/(A + A) - 1. 

The second case is when A + A > A + A. Then, 5A < 1 and hence II — aX\ = 1 — aX. Moreover, 1 — aX > aX — 1, 



so 



qa = max{l — aX, max{aA — 1, 1 — aX}} = 1 — aA 
The convergence factor of the multi-step iterations with inaccurate step-sizes ( fT~6] > follows directly from Theorem [T] 

J. Proof of Proposition [5] 

We analyze the four quadrants Qi through Q4 in order. 
Qi : when (g, e) £ Qi we have A > A and A > A > A. From convergence factor of multi-step weighted gradient 
method given in ( fT8] l it then follows that 

q = l + (3-aX-(3 1/2 . 

Moreover, since in this quadrant A + A > A + A, from ( fT7] > we have qo = 1 — 2A/ (A + A). A direct comparison 
between the two expressions yields that q< q~Q. 
Qi : when (§, e) € Q2 we have A < X and A < A. Combined with the stability assumption A + A > A, straightforward 
calculations show that the convergence factor of the multi-step iterations with inaccurate step-sizes ( fT~6] > is 

SA - $ - 1 - A + A < A + A, 
1 + P — a X — y/3 otherwise, 
Moreover, for this quadrant the convergence factor of weighted gradient method is given by ([17). To verify that 
q < qc we perform the following comparisons: 

(a) If A + A < A + A then we have q = aX-/3-l- /3 1/2 and q G = (2A)/(A + A) - 1. To show that q < q G 
we rearrange it to obtain the following inequality 

A = ( A - A + A 1 / 2 A 1 / 2 ) (A + A) - 2AA 1 / 2 A 1 / 2 < 0. 

Further simplifications yield 

A = (A + A - 2(AA) 1 / 2 )A - (A - (AA) 1 / 2 )(A + A) 
= (A 1 / 2 - A 1 / 2 ) 2 ! - A 1 / 2 (A 1//2 - A^XA + A) 
= (AV2 _ y/aj (p./* _ xU^J - X^(X + A)) 



(A i/ 2 _ ^i/aj (_ A i/2 (A + A - A) - A 1 / 2 !) 



< 
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Note that the negativity of above quantity comes from the stability condition, A + A > A. 

(b) If A + A > A + A then we have q= l + P-aX-ifi) 1 / 2 and q G = 1-(2A)/(A+A). After some simplifications, 
we see that q<q G boils down to the inequality - (A + A) A 1 / 2 A 1 / 2 + 2AA 1 / 2 A 1 / 2 - A(A + A) < or equivalently 
-(A + A - 2A)A 1/2 A 1 / 2 - A(A + A) < which holds by noting that A + A>A + A>2A. 

(c) for the case A + A = A + A, we have q = 1 + (3 — aX — (fi) 1 / 2 and q G = (A — A)/(A + A) which coincides 
with the optimal convergence factor of unperturbed gradient method. After some rearrangements we notice that 
q <q G reduces to checking that 

(AV3_AVa)(A_A) < (X 1 ' 2 + X 1 ' 2 ){X + X) 

that holds since A 1 / 2 - A 1 / 2 < A 1 / 2 + A 1 / 2 and A-A<A + A = A + A. 

Q3 : if (e, e) G Q3 we have < A < A and A < A. Combined with the stability assumption A + A > A, one can verify 
that the convergence factors of the two perturbed iterations are q G — (2A)/(A+A) — 1 and q = aX — f3 — 1 — (Z?) 1 / 2 , 
respectively. The fact that q <q G was proven in step (a) of the analysis of Q 2 - 

Q 4 : if (e, e) € Q 4 then, ([18]) implies that q = /3 1/2 . On the other hand, for this region (|T7) yields q G = (A- A) /(A+A) . 
To conclude, we need to verify that there exists A and A such that q > q G , i.e. such that (A 1 / 2 — A 1 / 2 )/(A 1 ^ 2 + 
A 1/2 ) > (A-A)/(A + A). We do so by multiplying both sides with (A + A)(A 1/2 + A 1/2 ) and simplifying to find 
that the inequality holds if AA 1 / 2 > AA 1 / 2 , or equivalently A/A > \ 2 /X 2 . The statement is proven. 

K. Proof of Proposition [6] 

The iterations ( |23j ) and ( p4| l are equivalent when 

(l-0 = "/3 
(l + /3)J-aW = CQ 

The first condition implies that £* = (1 + (3*). Combining this expression with the second condition, we find 

Q* = I ^—rW* = 1 ^W* 

1 + 13* A + A 

Noting that for the consensus case, A = \2(W*) and A = \ n (W*) concludes the proof. 
L. Proof of Lemma [4] 

For the upper bound on X n (RR T ), we use a similar approach as ||29l Lemma 3]. Specially, from ET1 p. 313], 

A 2 (i?i? T ) = \\RR T \\ 2 2 < ||i?i? T || 00 ||i?i? T ||i = ||i?i? T ||L- 

Hence, 

X n {RR T ) = m&xY,[RR T h> = max^J2 R ls R l's 

V V s 

^ m£tX Flls^max — -^max^max- 

s 

To find a lower bound on Xi(RR T ) we consider the definition Xi(RR T ) = min ||i? T a;|| 2 . We have 

1^112 = 1 

L L 

[r t x] s = Ysi^Uxi = J2 R i* x i- 

1=1 1=1 
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According to Assumption [5J R T has L independent rows that have only one non-zero (equal to 1) component. Hence, 

L S / L \ 

ii;? T ^ = E^+ E 

s=l s=S-L+l \l=l / 

= 1 + E (l>^l >i. 

where the last equality is due to ||a:||2 = !■ 
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